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Abstract. A new method of estimating some statistical characteristics of 
TCP flows in the Internet is developed in this paper. For this purpose, a new 
set of random variables (referred to as observables) is defined. When dealing 
with sampled traffic, these observables can easily be computed from sampled 
data. By adopting a convenient mouse / elephant dichotomy also dependent on 
traffic, it is shown how these variables give a reliable statistical representation 
of the number of packets transmitted by large flows during successive time 
intervals with an appropriate duration. A mathematical framework is devel- 
oped to estimate the accuracy of the method. As an application, it is shown 
how one can estimate the number of large TCP flows when only sampled traf- 
fic is available. The algorithm proposed is tested against experimental data 
collected from different types of IP networks. 
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1. Introduction 

In Internet traffic a flow is classically defined as the set of those packets with 
the same source and destination IP addresses together with the same source and 
destination port numbers and the same protocol type. It is well known that if large 
TCP flows carry the prevalent part of traffic (in Bytes), most of flows are small (in 
number of packets). A formal definition of "large" and "small" will be given later 
in the paper. As it will be seen, it may depend on the context; in a first step, the 
discussion is kept informal. 

We investigate in this paper how to characterize the statistical properties of 
the sizes of large flows (notably their number of packets) in Internet traffic. It 
is commonly observed in the technical literature and in real experiments that the 
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total size (in packets or bytes) of such flows has a heavy tailed distribution. In 
practice, however, this characterization holds only for very large values of the flow 
size. Consequently, in order to accurately estimate the tail of the size probability 
distribution, a large number of large flows is necessary. To increase the sample size 
when empirically estimating probability distribution tails, one is led to increase the 
length of the observation period. But the counterpart is that the distribution of 
the flow size can no more be described in terms of simple probability distributions, 
of the Pareto type for example. This is due to the fact that traffic is not stationary 
over long time periods, for instance because of daily variations of interactive services 
(video, web, etc.). 

Actually, numerous approaches have been proposed in the technical literature in 
order to model large flows as well as their superposition properties. One can roughly 
classify them in two categories: signal processing models and statistical models. 
Using ideas from signal processing, Abry and Veitch pQ , see also Feldman et al. [TU 
[15] and Crovella and Bestravos [8] , describe the spectral properties of the time series 
associated with IP traffic by using wavelets. In this way, a characterization of long 
range dependence (the Hurst parameter for example) can be provided. Straight lines 
in the log-log plot of the power spectrum support some of the "fractal" properties 
of the IP traffic, even if they may simply be due to packet bursts in data flows. See 
Rolland et al. [25] . 

Signal processing tools provide information on aggregated traffic but not on char- 
acteristics on individual TCP flows, like the number of packets or their transmission 
time. For statistical models, a representation with Poisson shot noise processes (and 
therefore some independence properties) has been used to describe the dynamics of 
IP traffic, see Hohn and Veitch [T7], Duffield et al. [TT], Gong et al. [TB], Barakat 
et al. [4] and Krunz and Makowski 18J for example. In Ben Azzouna et al. [3], 
Loiseau et al. (2Ql [19] and Gong et al. [16] , the distribution of the size of large flows 
is represented by a Pareto distribution, i.e. a probability distribution whose tail 
decays on a polynomial scale. , 

The starting point of some of these analyses is the need for understanding the 
relation between the distribution of the number S of sampled packets when per- 
forming packet sampling and the distribution of the flow size S. The problem can 
be described as follows: P(S = j) = Q(V{S = ■), j), j > 1, with 

1=3 W 

The problem then consists of finding a distribution cf>o maximizing some functional 
£((/>) so that the relation F(S = j) = Q(<j>,j) holds. See Loiseau et al. [19] for an 
extensive discussion of the current literature where our algorithm is called "stochas- 
tic counting" . As it will be seen in the following, we will not rely on the maximum 
likelihood ratio of distributions in our approach but on estimations of some averages 
to estimate some key parameters. 

Statistical Characterization Method. We develop in this paper an alternative 
method of obtaining a statistical description of the size of large flows in IP traffic 
by means of a Pareto distribution: Statistics are collected during successive time 
windows of limited length (instead of one single time window for the whole trace). It 
must be emphasized that this characterization in terms of a Pareto distribution does 
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not rely on the asymptotic behavior of the tail distribution but only on statistics 
on some range of values for the sizes of flows. 

The advantage of the proposed method is that with a careful procedure, a simple 
statistical characterization is possible and seems to be quite reliable as shown by 
our experiments for various sets of traffic traces. The intuitive reason for consid- 
ering short time periods is that on such times scales, flows exhibit only one major 
statistical mode (typically a Pareto behavior). In larger time windows, different 
modes due to the wide variety of flows and non-stationarity in IP traffic necessarily 
appear. (See Feldman et al. [15].) This approach allows us to establish a reliable 
statistical characterization of flows which is used to infer information from sampled 
traffic as it will be seen. The counterpart of that the distribution of the total size 
of a large flow (obtained when considering the complete traffic trace) cannot be 
obtained directly in this way since the trace is cut into small pieces. 

An algorithm is proposed to obtain the statistical representation of large flows 
when all the packets of the trace are available. The constants used in our algo- 
rithms are explicitly expressed as either universal constants (independent of traffic) 
or constants depending on traffic : Length of the observation window, definition 
of TCP flows referred to as large flows, etc. The procedure invoked to estimate 
flow statistics should not depend on some hidden pre-processing of the trace. Our 
algorithms determine on-line the constants depending on the traffic. This is, in our 
view, one important aspect which is sometimes neglected in the technical literature 

Application to Sampled Traffic. The basic motivation for developing a flow 
characterization method is to infer flow characteristics from sampled data. This 
is notably the case for sampling processes such as the 1-out-of-fc sampling scheme 
implemented by CISCO'S NetFlow [7], which greatly degrades information on flows. 
What we advocate in this paper is that it is still possible to infer relevant char- 
acteristics on flows from sampled data if some characteristics of the flow size can 
be confidently described by means of a simple Pareto distribution. By using the 
statistical representation described above, we propose a method of inferring the 
number of large flows from sampled traffic. 

The proposed method relies on a new set of random variables, referred to as 
observables and computed in successive time intervals with fixed length. Specif- 
ically, these random variables count the number of flows sampled once, twice or 
more in the successive observation windows. The properties of these variables can 
be obtained through simple characteristics, in particular mean values of variables 
instead of remote quantiles of the tail distribution, which are much more difficult 
to accurately estimate. By developing a convenient mathematical setting (Poisson 
approximation methods), it is moreover possible to show that quantities related 
to the observables under consideration are close to Poisson random variables with 
an explicit bound on the error. This Poisson approximation is the key result to 
estimate the total number of large flows. 

Organization of the paper. The organization of the paper is as follows. A statis- 
tical description of large TCP flows is presented in Section [21 this representation is 
tested against five exhaustive sets of traffic traces: three from the France Telecom 
(FT) commercial IP network carrying residential ADSL traffic and two others from 
Abilene network. An algorithm is developed in this section to compute the charac- 
teristics of the Pareto distributions describing flows. In Section[3l some assumptions 
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on sampled traffic are introduced and the observables for describing traffic are de- 
fined. The mathematical properties are analyzed in light of Poisson approximation 
methods in Section [U The results developed in this section are crucial to infer 
the statistics of an IP traffic from sampled data. Experiments with the five sets of 
sampled traces used in this paper are presented and discussed in Section [5] Some 
concluding remarks are presented in Section [6] 

2. Statistical Properties of Flows 

This section is devoted to a statistical study of the size (the number of packets) 
of flows in a limited time window of duration A. The goal of this section is show 
that a simple statistical representation of the flow size can be obtained for various 
sets of traffic traces. 

2.1. Assumptions and Experimental Conditions. 

The sets of traces used for testing theoretical results. For the experiments carried 
out in the following sections, several sets of traces will be considered: Commercial 
IP traffic, namely ADSL traces from the France Telecom (FT) IP collect network, 
and traffic issued from campus networks (Abilene III traces). Their characteristics 
are given in Table [TJ 



Table 1. Characteristics of traffic traces considered in experiments. 



Name 


Nb. IP packets 


Nb. TCP Flows 


Duration 


ADSL Trace A 


271 455 718 


20 949 331 


2 hours 


ADSL Trace B Upstream 


54 396 226 


2 648 193 


2 hours 


ADSL Trace B Downstream 


53 391 874 


2 107 379 


2 hours 


Abilene III Trace A 


62 875 146 


1 654 410 


8 minutes 


Abilene III Trace B 


47 706 252 


1 826 380 


8 minutes 



The Abilene traces 20040601-193121-l.gz (trace A) and 20040601-194000-0.gz 
(trace B) can be found at the url http://pma.nlanr.net/Traces/Traces/long/ipls/3/. 

Time Windows. Traffic will be observed in successive time windows with length 
A. In practice, the quantity A can vary from a few seconds to several minutes 
depending upon traffic characteristics on the link considered. 

The ideal value of A actually depends on the targeted application. For the 
design of network elements considering the flow level (e.g., flow aware routers, 
measurement devices, etc.), it is necessary to estimate the requirements in terms 
of memory to store the different flow descriptors. In this context, A may be of the 
order of few seconds. The same order of magnitude is also adapted to anomaly 
detection, for instance for detecting a sudden increase in the number of flows. For 
the computation of traffic matrices, A can be several minutes long (typically 15 
minutes). In our study, the "adequate" values for A are of the order of several 
seconds. See the discussion below. 
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Mice and Elephants. With regard to the analysis of the composition of traffic, in 
light of earlier studies on IP traffic (see Estan and Varghese [13], Papagiannaki et 
al. [52] or Ben Azzouna et al. [5]), two types of flows are identified: small flows with 
few packets (referred to as mice) and the other flows will be referred to as elephants. 
In commercial IP traffic, this simple traffic decomposition can be justified by the 
predominance of web browsing and peer-to-peer traffic giving rise to either signaling 
and very small file transfers (mice) or else file downloads (elephants). 

This dichotomy may be more delicate to verify in a different context than the 
one considered in Ben Azzouna et al. [2J. For LAN traffic, for example, there may 
be very large amounts of data transferred at very high speed. As it will be seen 
in the various IP traces used in our analysis, the distinction between mice and 
elephants has to be handled with care and in our case is dependent on the type of 
traffic considered. The distinction between the constants depending on the trace 
and "universal" constants is, in our view, a crucial issue. It amounts to precisely 
stating which constants are depend on traffic. This aspect is generally (unduly in 
our opinion) neglected in traffic measurement studies. In particular, the variable 
A and the dichotomy mice/elephants are dependent on the trace, as explained in 
the next section. 

2.2. Heavy Tails. The fact that the distribution of the size S of a large TCP flow is 
heavy tailed is well known. Experiments and theoretical results on the superposition 
of ON-OFF heavy tailed traffic have justified the self similar nature of IP traffic, 
see Crovella and Bestravos [5]. Although the heavy tailed property of the size of 
large flows is commonly admitted, little attention has been paid to identify properly 
a class of heavy tailed distributions so that the corresponding parameters can be 
estimated for an arbitrary traffic trace with a significant duration. 

One of the reasons for this situation is that the most common heavy tailed 
distributions G(x) = P(S > x) (e.g., Pareto, i.e., G{x) — C/x a for x > b and some 
a > 0, or Weibull, i.e., G(x) = exp(-vx^) for some (3 > and v > 0) have a 
very small number of parameters and consequently a limited of number of possible 
degrees of freedom for describing the distribution of the sizes of flows. For this 
reason, such a distribution can rarely represent the statistics of the total number 
of packets transmitted by a flow in a trace of arbitrary duration. 

As a matter of fact, if a traffic trace is sufficiently long, some non stationary 
phenomena may arise and the diversity of file sizes may not be captured by one or 
two parameters. For example, with a Pareto distribution, the function x — > G(x) 
in a log-log scale should be a straight line. The statistics of the file sizes in the 
traces used in our experiments are depicted in Figure [T] and [2] for an ADSL traffic 
trace from the France Telecom backbone IP collect network and for a traffic trace 
from Abilene network, respectively. 

Figure [T] and [2] clearly show that for the two traffic traces considered, the file size 
exhibits a multimodal behavior: At least several straight lines should be necessary 
to properly describe these distributions. These figures also exhibit the (intuitive) 
fact that has been noticed in earlier experiments: The longer the trace is, the 
more marked is the multimodal phenomenon. (See Ben Azzouna et al. [3] for a 
discussion.) 

The key observation when characterizing a traffic trace is the fact that if the 
duration A of the successive time intervals used for computing traffic parameters is 
appropriately chosen, then the distribution of the size of the main contributing flows 
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Figure 1 . Statistics of the number of packets S of a flow for ADSL 
A (2 hours): the quantity — log(P(5 > x)) as a function of log(x). 
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Figure 2. Statistics of the number of packets S of a flow for 
ABILENE A trace (8 minutes): the quantity - log(P(5 > x)) as a 
function of log(x). 

in the time interval can be represented by a Pareto distribution. More precisely, 
there exist A, B m i n , B max and a > such that if S is the number of packets 
transmitted by a flow in A time units, then P(S' > x \ S > B m i n ) ~ P a (x) for 

B m in < X < B m ax with 

(1) P<x(x) d = {~~^ ' for X - B min, 

and furthermore the proportion of large flows with size greater than B ma x is less 
than 5%. The parameter B m i n is usually referred to as the location parameter and 
a as the shape parameter. 

In other words, if the time interval is sufficiently small then the distribution of 
the number of packets transmitted by a large flow has one dominant Pareto mode 
and therefore can confidently be characterized by a unique Pareto distribution. The 
algorithm used to validate this result is described in Table [2j It is run from the 
beginning of the trace; in practice a couple of minutes is sufficient to obtain results 
for the constants A, B m i n , B max . The algorithm is of course valid when the total 
trace is available for at least an interval of several minutes. In the case of sampled 
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traffic for which this algorithm cannot be used, another method will be proposed 
in Section [3] 

Table 2. Algorithm for Identifying A and the Pareto Distribution. 

- A is fixed so that at least 1000 flows have more than 20 packets. 

- B max is defined as the smallest integer such that less than 5% of the flows 
have a size greater than B max . 

- A Least Square Method, see Deuflhard and Hohmann [9] for example, is 
performed to get a linear interpolation in a log-log scale of the distribution 
of sizes between B m in and B max . The constant B m i n is chosen as the small- 
est integer such that the -^-distance in the sense of least square method 
with the approximating straight line is less than 2.10~ 3 . The slope of the 
line gives the value of the parameter a. 



The quantity B m i n defines the boundary between mice and elephants in the 
trace. A mouse is a flow with a number of packets less than B m i n . An elephant is 
a flow such that its number of packets during a time interval of length A is greater 
than or equal to B m i n . By definition of B max , flows whose size is greater than 
Bmax represent a small fraction of the elephants. 

2.3. Experiments with Synthetic and Real Traffic Traces. Some experi- 
ments have been done using artificial traces with a real Pareto distribution. For 
these traces, the algorithm described in Table [2] has been used without any modi- 
fication: A time window is defined when at least 1000 flows of size greater than 20 
packets are detected. As it can be seen, the identification of the exponent a is quite 
good. Note that, because only Pareto distributed flows are present the minimal 
size B m i n of elephants is smaller than in real traffic. 

Experimental results with real traces, for the ADSL A and Abilene A traffic 
traces, are displayed in Figures [5] and respectively. The same algorithm has been 
run for the ADSL trace B Upstream and Downstream as well as for the Abilene 
III B trace. The benefit of the algorithm is that the distribution of the number of 
packets in elephants can always be represented by a unimodal Pareto distribution 
if the duration of A is adequately chosen by using the algorithm given in Table [2] 
Results are summarized in Table [3] 



Table 3. Statistics of the elephants for the different traffic traces. 





ADSL A 


ADSL B Up 


ADSL I 


i Down Abilene A 


Abilene B 


A (sec) 


5 


15 


15 


2 


2 


Brain 


20 


29 


39 


89 


79 


Bmax 


94 


154 


128 


324 


312 


a 


1.85 


1.97 


1.50 


1.30 


1.28 



2.4. On the choice of parameters. We discuss in this section the various pa- 
rameters used by the algorithm. 
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(b) Pareto a = 2.5. Estimation: a = 2.48, 

Bmin — Hi Bmax — 65 

Figure 3. Synthetic traces with 10 6 flows with a Pareto distribution 



Fixed parameters and parameters depending on traffic. There are four basic param- 
eters for the model which are determined by the trace: A (duration of time window 
for statistics) , the range of values [B min , B max ] for the Pareto distribution and the 
exponent a of this distribution. These parameters are discussed below. 

Additionally there are "universal" (i.e. independent of the trace): the minimal 
number of flows to make statistics, set to 1000 here, the proportion, 5%, of flows of 
size > B max , and the level of accuracy, 2.10 -3 here, of the least square method to 
determine B min and B max . 

Parameter B m i n . It turns out that for commercial (ADSL) traffic, the value of 
B min is close to 20. This value is fairly common in earlier studies for classifying 
ADSL traffic. It should be noted that this value is not at all universal since, in 
our view, it does depend on traffic. The examples with Abilene traces, see below, 
which contain significantly bigger elephants, shows that the corresponding values 
should be higher than 20 (around 80 in our example). 

The two types of traffic are intrinsically different: ADSL traffic is mainly com- 
posed of peer to peer traffic (with a huge number of small flows and a few file 
transfers of limited size because of the segmentation of large files into chunks), 
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(a) ADSL A trace - A = 5s 




1 10 100 1000 

(b) ADSL B Down trace - A = 15 seconds 




(c) ADSL B Up trace - A = 15 s 



Figure 4. Statistics of the flow size (number of packets) in a time 
interval of length A = 15 

while Abilene traffic comprises large file transfers issued from campus networks. In 
order to maximize the range for the Pareto description, the variable B m i n is defined 
as the smallest value for which the linear representation (in the log scale) holds. 

Parameter A. This parameter A is determined in a simple way by our algorithm. 
According to the various experiments, the parameter A can be taken in some range 
of values where the Pareto representation still holds. On the one hand, A has to 
be taken large enough so that sufficiently many packets arrive in time intervals of 
duration A to derive reliable estimations of the Pareto distribution. An experiment 
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(b) Abilene B trace - A = 2 s 

Figure 5. Statistics of the flow size (number of packets) in a time 
interval of length A for the Abilene traces. 

with ADSL A trace with A = Is gives only 63 flows of size more than 20 which 
is not enough to obtain reliable statistics. A "correct" value in this case is 5s. 
Experiments show that higher values (like 10s) do not change significantly the 
Pareto property observed in this case. 

On the other hand, A should not be too large so that the statistical properties 
(a Pareto distribution in our case) can be identified, i.e., so that the statistics are 
unimodal. See Figures Q] and [2] which illustrate situations where statistics are done 
on the complete trace, i.e. when A is taken equal to the total duration of the trace. 
In these examples, the piecewise linear aspect of the curves suggests, for both cases, 
there is at least a bi-modal Pareto behavior. 

2.5. Discussion. As it will be seen in the following, the above statistical model 
gives interesting results to extract information from sampled traffic. It has never- 
theless some shortcomings which are now discussed. 

A partial information when A is small. . It should be noted that the parameters 
computed in a time window of length A do not give a complete description of the 
distribution of the size of a large flow, since statistics are done over a limited time 
horizon. The procedure provides therefore a fragmented information. 

To obtain a complete description of the statistics of the size of flows, it would 
be necessary to relate the statistics from successive time windows of length A. We 
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do not know how to do that yet. Nevertheless, as it will be seen in the following, 
this fragmented information can be recovered from sampled traffic and it will be 
used to give a good estimation on the number of active large flows at a given time. 
This incomplete but useful description of the statistics is, in some sense, the price 
to pay to have a simple estimation of the statistics of flows. 

An incomplete description of large flows in a time window of size A. The repre- 
sentation with a Pareto distribution is for elephants (with size greater than B m i n ) 
whose size is less than B max . In particular, it does not give any information on 
the statistics of flows with size greater than B max . But note that, by definition, 
less that 5% of the total number of flows have a size greater than B max . This is 
however a source of errors when, as in Section [4] the Pareto representation is used 
on the interval [B rnin , +00] instead of [B min , B max ] 

3. Sampled Traffic: Assumptions and Definition of Observables 

In the previous section, an algorithm to describe the distribution of large flows 
by means of a unimodal distribution has been introduced. Now, it is shown how 
to exploit this algorithm in the context of packet sampling in the Internet. Packet 
sampling is a crucial issue when performing traffic measurements in high speed 
backbone networks. As a matter of fact, a fundamental problem related to the 
computation of flow statistics from traffic crossing very high speed transmission 
links is that, due to the enormous number of packets handled by routers, only a 
reduced amount of information can be available to the network operator. 

Packet sampling is in this context an efficient method of reducing the volume 
of data to analyze when performing measurements in the Internet. One popular 
technique consists of picking up one packet every other k s packets with k s = 100, 
500, 1000 in practice. (This sampling scheme is referred to as l-out-of-K s packet 
sampling in the technical literature.) This method is implemented for instance in 
CISCO routers, namely NetFlow facility [7] widely deployed in operational net- 
works today. It suffers from different shortcomings well identified in the technical 
literature, see for instance Estan et al. [12] . 

We describe in this section the different assumptions made on traffic in order to 
develop an analytical evaluation of our method of inferring flow statistics. Through- 
out this paper, high speed transmission links (at least 1 Gbit/s) will be considered. 

3.1. Mixing condition. When observing traffic, packets are assumed to be suffi- 
ciently interleaved so that those packets of a same flow are not back-to-back but 
mixed with packets of other flows. This introduces some randomness in the selec- 
tion of packets when performing sampling. In particular, when K flows are active 
in a given time window and if the ith flow comprises Vi packets during that period, 
then the probability of selecting a packet of the ith flow is assumed to be equal 
Uj/(«i + V2 + • • • + Vk)- This property will be referred to as mixing condition in 
the following and is formally defined as follows. A variant of this property is, im- 
plicitly at least, assumed in the existing literature. See, e.g. Duffield et al. [10] and 
Chabchoub et al. [6]. 

Definition 1 (Mixing Condition) . If K TCP flows are active during a time in- 
terval of duration A, traffic is said to be mixing if for all i, 1 < i < K , the total 
number Vi of packets sampled from the ith flow during that time interval has the 
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same distribution as the analog variable in the following scenario: at each sam- 
pling instant a packet of the ith flow is chosen with probability Vi/V where Vi is the 
number of packets of the ith flow and V = Vi + • • • + Vk- 

This amounts to claim that with regard to sampling, the probability of selecting 
a packet of a given flow is proportional to the total number of packets of this flow. 

One alternative would consist of assuming that the probability of selecting a 
packet of the ith flow is 1/K, the inverse of the total number of flows. This 
assumption, however, does not take into account the respective contributions of 
the different flows to the total volume and thus may be inaccurate. If all K flows 
had the same distribution with a small variance, then this assumption would not 
much differ from the mixing condition. Note however that the variance of Pareto 
distributions can be infinite if the shape parameter a is less than 2. Hence, this 
leads us to suppose that the mixing condition holds and that the probability of 
selecting a packet from flow i is indeed Vi/V. 

3.2. Negligibility assumption. We consider traffic on very high speed links and 
it then seems reasonable to assume that no flows contribute a significant proportion 
of global traffic. In other words, we suppose that the contribution of a given flow 
to global traffic is negligible. In the following, we go one step further by assuming 
that in any time window, the number of packets of a given flow is negligible when 
compared to the total number of packets in the observation window. By using 
the notation of the previous section, this amounts to assuming that for any flow i, 
the number of packet Vi is much less than V. Furthermore, we even impose that 
the squared value of Vi is much less than V. We specifically formulate the above 
assumptions as follows. 

Definition 2 (Negligibility condition). In any window of length A, the square of 
the number of packets of every flow is negligible when compared to the total number 
of packets V in the observation window. There specifically exists some < e -C 1 
such that for all i — 1, . . . , K , vf /V < e. 

The above assumption implies that no flows are dominating when observing 
traffic on a high speed transmission link. Table [4] shows that this is the case for 
the traces used in our experiments. There is thus no bias in the sampling process, 
which may be caused by the fact that some flows are oversampled because they 
contribute a significant part of traffic. This assumption is reasonable for commercial 
ADSL traffic because access links are often the bottlenecks in the network. For 
instance, ADSL users may have access rates of a few Mbit/s, which are negligible 
when compared against backbone links of I to 10 Gbit/s. Moreover, the bit rate 
achievable by an individual flow rarely exceeds a few hundreds of Kbit/s. In the 
case of transit networks carrying campus traffic, the above assumption may be 
more questionable since bulk data transfers may take place in Ethernet local area 
networks and individual flows may achieve bit rates of several Mbit/s. 

3.3. The Observables. We now introduce the different variables used to infer 
flow characteristics. These variables are based only upon sampled data; they can 
be evaluated when analyzing NetFlow records sent by routers of an IP network. 
For this reason, these variables are referred to as observables. Because of packet 
sampling, recall that the original characteristics of flows (for instance their duration 
or their original number of packets) cannot be directly observed. 
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Table 4. The quantity E(uf)/E(V") for traffic traces considered in experiments. 



Trace 


A = 5sec 


A = f Osec 


A = f 5sec 


ADSL A 


0.000f46 


0.000f59 


0.000f68 


ADSL B up 


O.OOffOO 




0.00f335 


ADSL B Down 


0.002f99 


0.002543 


0.002732 



Trace 


A = lsec 


A = 2sec 


A = 3sec 


A = 5sec 


Abilene A 


0.055001 


0.068833 


0.064813 


0.072768 
















Trace 


A = 1 sec 


A = 2sec 






Abilene B 


0.011786 


0.013804 





The observables considered in this paper to infer flow characteristics are the 
random variables Wj, j > 1, where Wj is the number of flows sampled j times 
during a time interval of duration A. The averages of the random variables Wj are 
in fact the key quantities used to infer the characteristics of flows from sampled 
data. 

The random variables Wj, j > 1 are formally defined as follows: Consider a 
time interval of length A and let K be the total number of large flows present in 
this time interval. Each flow i 6 {f , . . . , K} is composed of Vi packets in this time 
interval. Let denote by Vi the number of times that flow i is sampled. The random 
variable Wj is simply defined by 

(2) Wj = l {il=n + l {i2=j} + ■■■ + % K=J} . 

In practice, if A is not too large, the data structures used to compute the vari- 
ables Wj are reasonably simple. Moreover, as it will be seen in the following, 
provided that A is appropriately chosen, the statistics of the number of pack- 
ets transmitted by elephants during successive time windows with duration A are 
quite robust. Consequently, the variables Wj inherit also this property. When the 
number of large flows is large, the estimation of the asymptotics of their averages 
from the sampled traffic is easy in practice. Theoretical results on these variables 
are derived in the next section. 

4. Mathematical Properties of the Observables 

4.1. Definitions and Le Cam's inequality. For j > 0, the variable Wj defined 
by Equation @ is a sum of Bernoulli random variables, namely 

Wj = l{« 1= j} + % 2 =j} H h l{« K =j}i 

where ii is the number of times that the ith flow has been sampled. If these 
indicator functions were independent, by assuming that K is large, one could use 
to estimate the distribution of Wj either via a Poisson approximation (in a rare 
event setting) or via a central limit theorem (in a law of large numbers context). 
Since the total number of samples is known, the sum of the random variables Vi 
for i = l,...,K is known and then, the Bernoulli variables defining Wj are not 
independent. 

To overcome this problem, we make use of general results on the sum of Bernoulli 
random variables. Let us consider a sequence (Ii) of Bernoulli random variables, 
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i.e. Ii G {0, 1}. The distance in total variation between the distribution of X = 
I\ + • • • + Jj + • • ■ and a Poisson distribution with parameter 8 > is defined by 

\\P(x e •) - P(Q« e Ollto d = su p l p (^ g A) - P(Q S e A) | 

AcN 

-IE 

n>0 

The Poisson distribution Q$ with mean (5 is such that 

HQs = n) = ^exp(-J). 

TV. 

Note that the total variation distance is a strong distance since it is uniform with 
respect to all events, i.e., for all subset s A of N, 

\¥{X G A) - P(Q S G A) | < \\F(X G •) - HQs G 

The following result (see Barbour e£ aZ. [5]) gives a tight bound on the total 
variation distance between the distribution of X and the Poisson distribution with 
the same expected value when the Bernoulli variables are independent. In spite of 
the fact that this result is not directly applicable in our case, we shall show in the 
following how to use it to obtain information on the distributions of the observables 
W r 

Theorem 1 (Le Cam's Inequality). If the random variables (Ii) are independent 
and if X = J^. Jj, then 

(3) ||P(X G •) - P(Qk W G -Jilt* < £P(I 4 = l f = E(X)-Var(X) 

i 

If X is a Poisson distribution then Var(X) = E(X), the above relation shows 
that to prove the convergence to a Poisson distribution one has only to prove that 
the expectation of the random variable is arbitrarily close to its variance. 

4.2. Estimation of the mean value of the observables. We consider the 1- 
out-of-Ks deterministic sampling technique, where one packet is selected every other 
k s packets. In addition, we suppose that traffic on the observed link is sufficiently 
mixed so that the mixing condition given by Definition [1] holds and that there are 
no dominating flows in traffic so that the negligibility condition (Definition ^ also 
pertains. 

It is assumed that during a time interval of length A, there are K flows composed 
of at least B m i n packets, where B m i n is defined in Section [21 It has been seen 
that the number of packets in these flows follows a Pareto distribution defined 
by Relation ([T]) for some exponent a and parameters B m i n and B max . Let S be 
a random variable whose distribution is given by Relation |T]) for all x > B m i n . 
From our experiments, S is the size of a "typical" flow whose size is in the interval 
[B min , B max ]. See the discussion at the end of Section for the flows of size greater 
than B max . Of course the sizes of mice are not represented by this random variable. 
The variable V denotes the total number of packets in the observation window, note 
that it includes not only the elephants but also the mice. 

Note that V is the sum of the number of packets in elephants and mice. If Vi is 
the number of packet in the ith elephant, then m has the same Pareto distribution 



HX = n) r e" 
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as S (i.e., Vi — S) and V > v\ +v% + ■ ■ ■ +Vk- The difference V — V\ — v% — • i>k 

is the number of packets of mice. 

Proposition 1 (Mean Value of the Observables). If K elephants are active in a 
time window of length A, the mean number E(Wj) of flows sampled j times, j > 1, 
satisfies the relation 

where Q is the probability distribution defined by 

P(Q = j) d = Qj = e 

and p s — 1/k s is the sampling rate. 

From Equation (01 one gets that the larger the total volume V of packets is, the 
better is the approximation of E(Wj)/K by Qj. 

Proof. The number of times Vi that the ith flow is sampled in the time interval is 
given by 

Vi=Bi+B l 2 + --- + B l PaV , 

where, due to the mixing condition, B\ is equal to one if the £th sampled packet is 
from the zth flow, which event occurs with probability Vi/V. Note that the total 
number of sampled packets is p s V. 

Conditionally on the values of the set T — \y\ , . . . , vk }, the variables (B\,l > 1) 
are independent Bernoulli variables. For 1 < i < K . Le Cam's Inequality ([3]) gives 
therefore the relation 

wn^^-\^)-Qp s vA\ tv <Ps v f- 

By integrating with respect to the variables v\ , . . . , vk , this gives the relation 

||P(fli e •) - Q|| t0 < p s E 

In particular, for j e N, |P(w 4 = j) -Qj\< p s E (S 2 /V). Since 

K 

E(W j )=J2Hvi = j), 

i=l 

by summing on i = 1, . . . , K, one gets 

\E{W 3 ) - KQ,\ < Ps KE 

and the result follows. □ 

If the number of packets per flow were constant, then Q would be a Poisson 
distribution with parameter p s S, the variable S being in this case a constant. The 
above inequality shows that at the first order the expected value of Wj is p s E(S). 
The expression of Q, however, indicates that higher order moments of S play a 
significant role. For example, if the variable S has a significant variance, then the 
classical rough reduction, which consists of assuming that the size of a sampled 
elephant is p s S, is no longer valid for estimating the original size of the elephant. 



(4) 



E{W 3 ) 



K 
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Under the negligibility condition, we deduce that 



K 



< PsS, 



where e appears in Definition [2] and is assumed to much less than 1. This implies 
that Inequality (j4]) is tight and the quantity M(Wj)/K can accurately be approxi- 
mated by the quantity Qj, when no flows are dominating in traffic. 

We are now ready to state the main result needed for estimating the number K 
of elephants from sampled data. 

Proposition 2 (Asymptotic Mean Values). Under the same assumptions as those 
of Proposition [71 

(5) lim E( ^ +l) 1 Q + 1 
and 

(6) lim ^ r, a(p s B min r^i^, 

if B max >> 1 and p s B m i n << 1, where T is the classical Gamma function defined 
by 

r+oc 

T(x) = / u x - l e- u du, x > 0. 

Proof. For j ' > 1, 

® j=E ((Pipl e -Pss} ^ aB a n J^_ [ + °° {PsU y—i e -p^ du 

and then 



v- J r- jB miri 



n a r +co V(i — n\ 

Qj - aB a min ^ / u^-'e^ du ~ a( Ps B mm ) a U 

since p s B m i n ~ 0. Therefore, by using the relation r(x + 1) = xr(x) we obtain the 
equivalence 

Qj+i j -a 



Qj J + 1 

The proposition follows by using the fact that the upper bound of Equation ((4]) of 
Proposition [T] goes to by the law of large numbers. □ 



As it will be seen later in the next section, Relation §5§ is used to estimate 
the exponent a of the Pareto distribution of the number of packets of elephants, 
the quantities E(W}) and E(Wj+i) being easily derived from sampled traffic. The 
quantity K will be estimated from Relation ([6|) . The estimation of the parameter 
Bmin from sampled traffic as well as the correct choice of the integer j will be 
discussed in the next section. 



STATISTICAL CHARACTERIZATION OF FLOWS IN INTERNET TRAFFIC 



17 



5. Applications 



5.1. Traffic parameter inference algorithm. In this section, it is assumed that 
only sampled traffic is available. The methods described in Section [5] to infer the 
statistical properties of the flows cannot be applied and another algorithm has to be 
defined. For the experiments carried out in the present section, the sampling factor 
p s = 1/k s has been taken equal to 1/100. To infer flow characteristics, we have 
to give the proper definition of the mouse and elephant dichotomy (the parameter 
Bmin) and to estimate the coefficient of the corresponding Pareto distribution (the 
parameter a in Relation ([1])). 

Relation ([5]) gives the following equivalence, for j > 1 sufficiently large so that 
the impact of mice on E(Wj) is negligible, 



and Relation Q yields an estimate of the number of elephants, i.e. the number of 
flows with a number of packets greater than or equal to B m i n ; we specifically have 



These estimations greatly depend on some of the key parameters used to obtain a 
convenient and confident Pareto representation of the size of the flows, in particular 
the size of the time window A and the lower bound B min for the elephants. The 
variable A is chosen so that 

(1) the number of flows sampled twice is sufficiently large in order to obtain a 
significant number of samples so that the estimation of the mean values of 
the random variables Wj for j > 2 is accurate; this requires that A should 
not be too small, 

(2) A is not too large in order to preserve the unimodal Pareto representation 
(see Section [2] for a discussion). 

To count the average number of flows sampled j times, the parameter j should be 
chosen as large as possible in order to neglect the impact of mice (for which the 
Pareto representation does not hold) but not too large so that the statistics are 
robust to compute the mean value E(Wj). 

In the experimental work reported below, special attention has been paid to 
the choice of the universal constants, i.e., those constants used in the analysis of 
sampled data, that do not depend on the traffic trace considered. In our opinion, 
this is a crucial in an accurate inference of traffic parameters from sampled data. 
These constants are defined in the algorithm given in Table [5j 

Table 5. Algorithm used to identify A and the Pareto parameter 
from sampled traffic. 



- Choose A so that 80 < E[W 2 ] < 100; 

- Choose j so that \a(j) — a(j + 1)| computed with Equation is minimized 
with for all j such that E[Wj] > 5. 

- B m i n is the smallest integer so that the probability that a flow of size greater 
than B m i n is sampled more than j times is greater than p s /10; 




(8) 



K ~ K(j) 



del 



a(j)(p s B min )"U)r(J - a(j))' 
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5.2. Experimental results. Concerning the estimation of the constants B m i n , the 
numerical results obtained by using the algorithm given in Table [5] are presented 
in Table where the values of the different B m i n estimated by the algorithm 
are compared against the values given in Section [2] As it can be observed, the 
proposed algorithm yields a rather conservative definition of elephants (i.e., flows 
of size greater than or equal to B m i n ). 

Table 6. Elephants for the France Telecom ADSL and the Abilene 
traffic traces. 





ADSL A 


ADSL B Up 


ADSL B Down 


Abilene A 


Abilene B 


Bmin 


20 


29 


39 


89 


79 


estimated B m i n 


21 


45 


45 


77 


77 



The main results are gathered in Table [7] giving the quantities K and a estimated 
by using Equations and © for different values of the parameters j. These 
values are compared against the experimental values a exp and K exp , referred to as 
the "real" a and K obtained from the complete traffic traces in Section [2] The 
accuracy of the estimation of K is generally quite good except for the Abilene 
A trace where the error is significant although not out of bound. A look at the 
corresponding figure in Section [2] gives a plausible explanation for this discrepancy: 
For this trace, the Pareto representation is not very precise. 

Finally, it is worth noting from Table [7] that the estimation of the important 
parameter a describing the statistics of flows is also quite accurate. The error in 
this table is defined as 

K(j) - K exp 
K 



Table 7. Estimations of the Number of Elephants from Sampled traffic 



Trace 


A 


j 


E(Wf) 




O'exp 


a(j) 


^exp 


K(j) 


Error 


ADSL A 


5s 


3 


12.89 


3.33 


1.85 


1.95 


943.71 


1031.04 


9.25% 


ADSL B Do 


15s 


4 


9.7 


4.75 


1.49 


1.55 


414.90 


404.13 


2.59% 


ADSL B Up 


15s 


4 


7.46 


2.97 


1.97 


2.00 


453.01 


462.68 


2.13% 


ABILENE A 


Is 


5 


6.04 


3.21 


1.38 


1.81 


217.44 


270.79 


24.53% 


ABILENE B 


Is 


5 


6.1 


3.7 


1.36 


1.51 


209.12 


197.12 


5.74% 



Remark. As pointed out by Loiseau et al. [19], the determination of A is crucial. 
Recall it is determined explicitly by the first step of our algorithm, see Table [5] 

6. Conclusion 

We have developed in this paper one method of characterizing flows in IP traffic 
by a few parameters and another one of inferring these parameters from sampled 
data obtained via deterministic 1-out-of-A: sampling. For this purpose, we have 
made some restrictive assumptions, which are in our opinion essential in order 
to establish an accurate characterization of flows. The basic principle we have 
adopted consists of describing flows in successive observation windows of limited 
length, which has to satisfy two contradicting requirements. On the one hand, 
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observation windows shall not to be too large in order to preserve a description of 
flow statistics as simple as possible, for instance their size by means of a simple 
Pareto distribution. 

On the other hand, a sufficiently large number of packets has to be present in 
each observation window in order to be able of computing How characteristics with 
sufficient accuracy, in particular the tail of the distribution of the flow size. By 
assuming that large flows (elephants) have a size which is Pareto distributed, we 
have developed an algorithm to determine the optimal observation window length 
together with the parameters of the Pareto distribution. The location parameter 
Bmin (see Equation |[T|)) leads to a natural division of the total flow population 
into two sets: those flows with at least B m i n packets, referred to as elephants, and 
those flows with less than B m i n packets, called mice. This method of characterizing 
flows has been tested against traffic traces from the France Telecom and Abilene 
networks carrying completely different types of traffic. 

For interpreting sampled data, we have made assumptions on the sampling pro- 
cess. We have specifically supposed that flows are sufficiently interleaved in order 
to introduce some randomness in the packet selection process (mixing condition) 
and that there are no dominating flows so that there is no bias with regard to the 
probability of sampling a flow (negligibility condition). These two assumptions al- 
lows us to establish rigorous results for the number of times an elephant is sampled, 
in particular for the mean values of the random variables Wj, j > 1. 

Of course, when analyzing sampled data, the original flow statistics are not 
known. In particular, the length of the observation window necessary to character- 
ize the flow size by means of a unique Pareto distribution is unknown. To overcome 
this problem, we have proposed an algorithm to fix the observation window length 
and the minimal length of elephants. Then, by choosing the index j sufficiently large 
so as to neglect the impact of mice, the theoretical results are used to complete the 
flow parameter inference. This method has been tested against the Abilene and the 
France Telecom traffic traces and yields satisfactory results. 
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